WEEK 4: DATA JOINS AND TRANSFORMATIONS

Tuesday, January 30th

Today we will…

  • Review Lab 3: Teaching Evaluations
  • New Material
    • Lambda functions
    • Factor Variables
    • Pivoting with tidyr
    • Joining with dplyr
  • PA 4.1: Military Spending

Lambda Functions

We know how to use across() to apply built-in functions across many columns.

cereal |> 
  summarize(across(calories:potass, mean))
  calories  protein      fat   sodium    fiber   carbo   sugars   potass
1 106.8831 2.545455 1.012987 159.6753 2.151948 14.5974 6.922078 96.07792

But what if we want to pass additional arguments to a function?

  • E.g., we need to drop missing data (na.rm = T).
  • E.g., we need to calculate a trimmed mean (trim = 0.1).

Lambda Functions

When you want to specify additional arguments, you need to use a lambda function rather than just the function name:

~ <FUN_NAME>(.x, <ARGS>)

Calculate trimmed means, trimming 10% of the data on each end.

cereal |> 
  summarize(across(calories:potass, ~ mean(.x, trim = 0.1)))
  calories protein       fat  sodium    fiber    carbo   sugars   potass
1 107.3016 2.47619 0.8888889 163.254 1.773016 14.71429 6.920635 86.19048

Factor Variables

What is a factor variable?

Factors are used for

  1. categorical variables with a fixed and known set of possible values.
  • E.g., day_born = Sunday, Monday, Tuesday, …, Saturday
  1. displaying character vectors in non-alphabetical order.

Creating a Factor – Base R

word_data <- tibble(word = c("apple", "dog", "banana", "cat",
                             "banana", "Queen Elizabeth", "dog")
           )
word_data
# A tibble: 7 × 1
  word           
  <chr>          
1 apple          
2 dog            
3 banana         
4 cat            
5 banana         
6 Queen Elizabeth
7 dog            


factor(word_data$word)
[1] apple           dog             banana          cat            
[5] banana          Queen Elizabeth dog            
Levels: apple banana cat dog Queen Elizabeth


What happened??

Creating a Factor – Base R

When you create a factor variable from a vector…

  • Every unique element in the vector becomes a level.
  • The levels are ordered alphabetically.
  • The elements are no longer displayed in quotes.

Creating a Factor – Base R

You can specify the order of the levels with the level argument.

factor(word_data$word, levels = c("banana","apple","cat",
                             "dog", "Queen Elizabeth")
       )
[1] apple           dog             banana          cat            
[5] banana          Queen Elizabeth dog            
Levels: banana apple cat dog Queen Elizabeth

forcats

Common tasks

  • Turn character variables into factors.

  • Make factors by discretizing numeric variables.

  • Rename or reorder the levels of an existing factor.

Note

The packages forcats (“for categoricals”) helps wrangle categorical variables.

  • forcats loads with tidyverse!

Creating a Factor – forcats

word_data <- tibble(word = c("apple", "dog", "banana", "cat",
                             "banana", "Queen Elizabeth", "dog")
           )
word_data |> 
  pull(word)
[1] "apple"           "dog"             "banana"          "cat"            
[5] "banana"          "Queen Elizabeth" "dog"            
word_data |> 
  mutate(word = fct(x = word)) |> 
  pull(word)
[1] apple           dog             banana          cat            
[5] banana          Queen Elizabeth dog            
Levels: apple dog banana cat Queen Elizabeth
  • With forcats, the levels are automatically ordered in the order of first appearance.

Creating a Factor – forcats

You can still specify the order of the levels with the level argument.

word_data |> 
  mutate(word = fct(x = word, 
                    levels = c("apple", "banana", "cat",
                               "dog", "Queen Elizabeth")
                    )
         ) |> 
  pull(word)
[1] apple           dog             banana          cat            
[5] banana          Queen Elizabeth dog            
Levels: apple banana cat dog Queen Elizabeth

You can also specify non-present levels.

word_data |>
  mutate(word = fct(x = word, 
                    levels = c("apple", "banana", "cat", "dog",
                               "Queen Elizabeth", "King Charles")
                    )
         ) |> 
  pull(word)
[1] apple           dog             banana          cat            
[5] banana          Queen Elizabeth dog            
Levels: apple banana cat dog Queen Elizabeth King Charles

Re-coding a Factor – fct_recode

Change the existing levels with the syntax:

<new level> = <old level>.

word_data |> 
  mutate(word = fct_recode(.f = word,
                           "fruit" = "apple",
                           "fruit" = "banana",
                           "pet"   = "cat",
                           "pet"   = "dog")
         ) |> 
  pull(word)
[1] fruit           pet             fruit           pet            
[5] fruit           Queen Elizabeth pet            
Levels: fruit pet Queen Elizabeth

Note

Queen Elizabeth is a “remaining” level that is not recoded.

Collapsing a Factor – fct_collapse

Collapse the existing levels with the syntax:

<new level> = c(<old levels>).

word_data |> 
  mutate(word = fct_collapse(.f = word,
                           "fruit" = c("apple", "banana"),
                           "pet"   = c("cat", "dog")
                           )
         ) |> 
  pull(word)
[1] fruit           pet             fruit           pet            
[5] fruit           Queen Elizabeth pet            
Levels: fruit pet Queen Elizabeth

Note

Queen Elizabeth is a “remaining” level that is not recoded.

Re-leveling a Factor –fct_relevel

Change the order of the levels of an existing factor.

word_data |>
  pull(word) |> 
  levels()
[1] "fruit"           "pet"             "Queen Elizabeth"
word_data |> 
  mutate(word = fct_relevel(.f = word,
                            c("Queen Elizabeth", "pet", "fruit")
                            )
         ) |> 
  pull(word) |> 
  levels()
[1] "Queen Elizabeth" "pet"             "fruit"          

Unspecified levels will be in the same order at the end of the list.

word_data |> 
  mutate(word = fct_relevel(.f = word, 
                            "pet")
         ) |> 
  pull(word) |> 
  levels()
[1] "pet"             "fruit"           "Queen Elizabeth"

Re-coding a Factor – fct_recode

library(liver)
data(cereal)
str(cereal$manuf)
 Factor w/ 7 levels "A","G","K","N",..: 4 6 3 3 7 2 3 2 7 5 ...
cereal |> 
  mutate(manuf = case_when(manuf == "A" ~ "American Home Food Products", 
                           manuf == "G" ~ "General Mills", 
                           manuf == "K" ~ "Kelloggs", 
                           manuf == "N" ~ "Nabisco", 
                           manuf == "P" ~ "Post", 
                           manuf == "Q" ~ "Quaker Oats", 
                           manuf == "R" ~ "Ralston Purina"),
         manuf = as.factor(manuf)) |> 
  select(manuf) |> 
  summary()
                         manuf   
 American Home Food Products: 1  
 General Mills              :22  
 Kelloggs                   :23  
 Nabisco                    : 6  
 Post                       : 9  
 Quaker Oats                : 8  
 Ralston Purina             : 8  
cereal |> 
  mutate(manuf = fct_recode(manuf, 
                             "American Home Food Products" = "A", 
                             "General Mills" = "G", 
                             "Kelloggs" = "K", 
                             "Nabisco" = "N", 
                             "Post" = "P", 
                             "Quaker Oats" = "Q", 
                             "Ralston Purina" = "R")) |> 
  select(manuf) |> 
  summary()
                         manuf   
 American Home Food Products: 1  
 General Mills              :22  
 Kelloggs                   :23  
 Nabisco                    : 6  
 Post                       : 9  
 Quaker Oats                : 8  
 Ralston Purina             : 8  

Re-ordering Factors in ggplot2

library(ggridges)
cereal_recode |> 
  ggplot(aes(x = sugars, 
             y = manuf, 
             fill = manuf)) +
  geom_density_ridges() +
  theme_minimal() +
  theme(legend.position = "none")

cereal_recode |> 
  ggplot(aes(x = sugars, 
             y = fct_reorder(.f = manuf, 
                             .x = sugars,
                             .fun = mean), 
             fill = manuf)) +
  geom_density_ridges() +
  theme_minimal() +
  theme(legend.position = "none")

We can reorder factor levels based on a summary value.

  • Note: default reordering function is the median.

Re-ordering Factors in ggplot2

cereal_recode |> 
  group_by(manuf, shelf) |> 
  summarise(avg_sugar = mean(sugars, na.rm = TRUE)) |> 
  ggplot(aes(x = shelf, 
             y = avg_sugar, 
             color = manuf)) +
  geom_line(size = 1.5) +
  theme_minimal() +
  labs(color = "Manuf.")

The order of the legend follows the order of the factor levels.

cereal_recode |> 
  group_by(manuf, shelf) |> 
  summarise(avg_sugar = mean(sugars, na.rm = TRUE)) |> 
  ggplot(aes(x = shelf, 
             y = avg_sugar, 
             color = fct_reorder2(manuf, .x = shelf, .y = avg_sugar))) +
  geom_line(size = 1.5) +
  theme_minimal() +
  labs(color = "Manuf.")

We can reorder factor levels based on the y values associated with the largest x values.

  • The legend order will match the line order on the far right.

Glamour of Graphics

Data Layouts

Tidy Data

Tidy data…

  • is rectangular.
  • has observations as rows and variables as columns.
  • has different formats for different tasks.

R4DS

Consequences of Messy Data

Illustration by Allison Horst

  • Tidy: use the same tools in similar ways for different datasets.
  • Messy: create unique tools that are difficult to generalize.

Creating Tidy Data

We may need to transform our data to turn it into the version of tidy that is best for a task at hand.

Illustration by Allison Horst

Creating Tidy Data

We want to look at mean cereal nutrients based on shelf.

  • The data are in a wide format – a separate column for each nutrient.
  • Transforming the data will make plotting easier.
library(liver)
data(cereal)
head(cereal)
name manuf type calories protein fat sodium fiber carbo sugars potass vitamins shelf weight cups rating
100% Bran N cold 70 4 1 130 10.0 5.0 6 280 25 3 1 0.33 68.40297
100% Natural Bran Q cold 120 3 5 15 2.0 8.0 8 135 0 3 1 1.00 33.98368
All-Bran K cold 70 4 1 260 9.0 7.0 5 320 25 3 1 0.33 59.42551
All-Bran with Extra Fiber K cold 50 4 0 140 14.0 8.0 0 330 25 3 1 0.50 93.70491
Almond Delight R cold 110 2 2 200 1.0 14.0 8 -1 25 3 1 0.75 34.38484
Apple Cinnamon Cheerios G cold 110 2 2 180 1.5 10.5 10 70 25 1 1 0.75 29.50954

Creating Tidy Data

Code
cereal_wide <- cereal |> 
  group_by(shelf) |> 
  summarise(across(calories:vitamins, mean))
shelf calories protein fat sodium fiber carbo sugars potass vitamins
1 102.5000 2.650000 0.60 176.2500 1.6850000 15.80000 4.800000 75.50000 20.00000
2 109.5238 1.904762 1.00 145.7143 0.9047619 13.61905 9.619048 57.80952 23.80952
3 107.7778 2.861111 1.25 158.6111 3.1388889 14.50000 6.527778 129.83333 35.41667
Code
my_colors <- c("calories_col" = "steelblue", "sugars_col" = "orange3")

cereal_wide |> 
  ggplot() +
  geom_point(aes(x = shelf, y = calories, color = "calories_col")) +
  geom_line(aes(x = shelf, y = calories, color = "calories_col")) + 
  geom_point(aes(x = shelf, y = sugars, color = "sugars_col")) +
  geom_line(aes(x = shelf, y = sugars, color = "sugars_col")) +
  scale_color_manual(values = my_colors, labels = names(my_colors)) +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount", color = "Nutrient")

Code
cereal_long<- cereal |> 
  pivot_longer(cols = calories:vitamins,
               names_to = "Nutrient",
               values_to = "Amount") |> 
  group_by(shelf, Nutrient) |> 
  summarise(mean_amount = mean(Amount))
shelf Nutrient mean_amount
1 calories 102.5000000
1 carbo 15.8000000
1 fat 0.6000000
1 fiber 1.6850000
1 potass 75.5000000
1 protein 2.6500000
1 sodium 176.2500000
1 sugars 4.8000000
1 vitamins 20.0000000
2 calories 109.5238095
2 carbo 13.6190476
2 fat 1.0000000
2 fiber 0.9047619
2 potass 57.8095238
2 protein 1.9047619
2 sodium 145.7142857
2 sugars 9.6190476
2 vitamins 23.8095238
3 calories 107.7777778
3 carbo 14.5000000
3 fat 1.2500000
3 fiber 3.1388889
3 potass 129.8333333
3 protein 2.8611111
3 sodium 158.6111111
3 sugars 6.5277778
3 vitamins 35.4166667
Code
cereal_long |> 
  ggplot(aes(x = shelf, 
             y = mean_amount, 
             color = Nutrient)) +
  geom_point() +
  geom_line() +
  labs(x = "Shelf", y = "", subtitle = "Mean Amount")

Pivoting Data

Tidyexpalin animation by Kelsey Gonzalez

Manual Method

Consider daily rainfall observed in SLO in January 2023.

  • The data is in a human-friendly form (like a calendar).
  • Each week has a row, and each day has a column.

Data source

How would you manually convert this to long format?

Manual Method: Steps

  1. Create a new column: Day_of_Week.
  2. Create a new column: Rainfall (hold daily rainfall values).
  3. Now we have three columns (Week, Day_of_Week, and Rainfall) – start moving Sunday values over.
  4. Duplicate Week 1-5 and copy Monday values over.
  5. Duplicate Week 1-5 and copy Tuesday values over.
  6. Continue for the rest of the days of the week.
  7. You may want to arrange() by Week to get the rainfall values chronological order.

Computational Approach

We can use pivot_longer() to turn a wide dataset into a long(er) dataset.

pivot_longer()

Take a wide dataset and turn it into a long daaset.

  • cols – specify the columns that should be pivoted.
    • Do not include the names of ID columns (columns to not be pivoted).
  • names_to – the name of the new column containing the old column names.
  • values_to – the name of the new column containing the old column values.

pivot_longer()

library(readxl)
slo_rainfall <- read_xlsx("data/2023-rainfall-slo.xlsx")

slo_rainfall |> 
  mutate(across(Sunday:Saturday, as.numeric)) |> 
  pivot_longer(cols      = Sunday:Saturday,
               names_to  = "Day_of_Week",
               values_to = "Daily_Rainfall")
Week Day_of_Week Daily_Rainfall
1 Sunday 0.00
1 Monday 0.12
1 Tuesday 0.00
1 Wednesday 1.58
1 Thursday 0.91
1 Friday 0.00
1 Saturday 0.05
2 Sunday 0.27
2 Monday 4.26
2 Tuesday 0.43
2 Wednesday 0.00
2 Thursday 0.00
2 Friday 0.16
2 Saturday 1.41
3 Sunday 0.34
3 Monday 0.33
3 Tuesday 0.00
3 Wednesday 0.00
3 Thursday 0.13
3 Friday 0.00
3 Saturday 0.00
4 Sunday 0.00
4 Monday 0.00
4 Tuesday 0.00
4 Wednesday 0.00
4 Thursday 0.00
4 Friday 0.00
4 Saturday NA
5 Sunday NA
5 Monday NA
5 Tuesday NA
5 Wednesday NA
5 Thursday NA
5 Friday NA
5 Saturday NA

pivot_wider()

Take a long dataset and turn it into a wide daaset.

  • id_cols – specify the column(s) that contain the ID for unique rows in the wide dataset.
  • names_from – the name of the column containing the new column names.
  • values_from – the name of the column containing the new column values.

pivot_wider()

We calculate the mean amount of protein for cereals on each shelf and for each manufacturer.

mean_protein <- cereal |> 
  group_by(manuf, shelf) |> 
  summarize(mean_protein = mean(protein))
manuf shelf mean_protein
A 2 4.000000
G 1 3.000000
G 2 1.285714
G 3 2.666667
K 1 2.750000
K 2 2.142857
K 3 2.916667
N 1 2.666667
N 2 2.500000
N 3 4.000000
P 1 1.500000
P 2 1.000000
P 3 3.000000
Q 1 5.000000
Q 2 2.000000
Q 3 2.500000
R 1 2.000000
R 3 3.000000

pivot_wider()

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein)
manuf 1 2 3
G 3.000000 1.285714 2.666667
K 2.750000 2.142857 2.916667
N 2.666667 2.500000 4.000000
P 1.500000 1.000000 3.000000
Q 5.000000 2.000000 2.500000
R 2.000000 NA 3.000000
A NA 4.000000 NA

Better names in pivot_wider()

mean_protein |> 
  arrange(shelf) |> 
  pivot_wider(id_cols = manuf,
              names_from = shelf,
              values_from = mean_protein,
              names_prefix = "Shelf_")
manuf Shelf_1 Shelf_2 Shelf_3
G 3.000000 1.285714 2.666667
K 2.750000 2.142857 2.916667
N 2.666667 2.500000 4.000000
P 1.500000 1.000000 3.000000
Q 5.000000 2.000000 2.500000
R 2.000000 NA 3.000000
A NA 4.000000 NA

Data Joins

Relational Data

Multiple, interconnected tables of data are called relational.

  • It is the relation between data sets, not just the individual data sets themselves, that are important.

IMDb movie relational data

Data Joins

We can combine (join) data tables based on their relations.

Mutating joins

Add variables from a new dataframe to observations in an existing dataframe.

full_join(), left_join(), right_join(), inner_join(), outer_join()

Filtering Joins

Filter observations based on values in new dataframe.

semi_join(), anti_join()

Keys

A key uniquely identifies an observation in a data set.

  • To combine (join) two datasets, the key needs to be present in both.

inner_join()

Keeps obsertvations when their keys are present in both datasets.

inner_join(): IMDb Example

directors_genres
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
2931 Drama 0.714286
2931 Action 0.428571
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455
movies_directors
director_id movie_id
429 300229
9247 124110
11652 10920
11652 333856
14927 192017
15092 109093
15092 237431

ID: 429, 2931, 11652, 14927, 15092       ID: 429, 9247, 11652, 14927, 15092

inner_join(directors_genres, movies_directors)
director_id genre prob movie_id
429 Adventure 0.750000 300229
429 Fantasy 0.750000 300229
11652 Sci-Fi 0.500000 10920
11652 Sci-Fi 0.500000 333856
11652 Action 0.500000 10920
11652 Action 0.500000 333856
14927 Animation 1.000000 192017
14927 Family 1.000000 192017
15092 Comedy 0.545455 109093
15092 Comedy 0.545455 237431
15092 Crime 0.545455 109093
15092 Crime 0.545455 237431

ID: 429, 2931, 9247, 11652, 14927, 15092

inner_join(): IMDb Example

What if our key does not have the same name?

directors_genres
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
2931 Drama 0.714286
2931 Action 0.428571
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455
directors
id first_name last_name
429 Andrew Adamson
9247 Zach Braff
11652 James (I) Cameron
14927 Ron Clements
15092 Ethan Coen
inner_join(directors_genres, 
           directors, 
           by = join_by(director_id == id))
id first_name last_name genre prob
429 Andrew Adamson Adventure 0.750000
429 Andrew Adamson Fantasy 0.750000
11652 James (I) Cameron Sci-Fi 0.500000
11652 James (I) Cameron Action 0.500000
14927 Ron Clements Animation 1.000000
14927 Ron Clements Family 1.000000
15092 Ethan Coen Comedy 0.545455
15092 Ethan Coen Crime 0.545455

Join by different variables on dataX and dataY: join_by(a == b) will match dataX$a to dataY$b.

Piping Joins

Remember: the dataset you pipe in becomes the first argument of the function you are piping into!

  • So if you are using a pipe, you will only be specifying the right dataset inside the join function.
inner_join(directors_genres, movies_directors)

…is equivalent to…

directors_genres |> 
  inner_join(movies_directors)

Mutating Joins

  • left_join() – keep only (and all) observations in the left data set

  • right_join() – keep only (and all) observations in the right data set

  • full_join() – keep all observations in both data sets

Mutating Joins

Which directors would remain for each of the following?

  • left_join(directors_genres, movies_directors)
  • right_join(directors_genres, movies_directors)
  • full_join(directors_genres, movies_directors)

directors_genres |> 
  distinct(director_id)
director_id
429
2931
11652
14927
15092
movies_directors |> 
  distinct(director_id)
director_id
429
9247
11652
14927
15092

Filtering Joins: semi_join()

Keeps observations when their keys are present in both datasets, but only keeps variables from the first dataset.


→  

Filtering Joins: semi_join()

directors_genres |> 
  semi_join(movies_directors)
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455

Movie Directors: 429, 2931, 11652, 14927, 15092

directors_genres |>
  filter(director_id %in% movies_directors$director_id)
director_id genre prob
429 Adventure 0.750000
429 Fantasy 0.750000
11652 Sci-Fi 0.500000
11652 Action 0.500000
14927 Animation 1.000000
14927 Family 1.000000
15092 Comedy 0.545455
15092 Crime 0.545455

Filtering Joins: anti_join()

Removes observations when their keys are present in both datasets, and only keeps variables from the first dataset.


→  


Filtering Joins: anti_join()

directors_genres |> 
  anti_join(movies_directors)
director_id genre prob
2931 Drama 0.714286
2931 Action 0.428571

Movie Directors: 429, 2931, 11652, 14927, 15092

directors_genres |>
  filter(!director_id %in% movies_directors$director_id)
director_id genre prob
2931 Drama 0.714286
2931 Action 0.428571

PA 4: Military Spending

Today you will be tidying messy data to explore the relationship between countries of the world and military spending.

To do…

  • PA 4: Military Spending
    • Due Thursday, 2/1 at 8:00am

Thursday, February 2

Today we will…

  • Review Lab 4: Teacher Evaluations
  • Miscellaneous…
    • Clean Variable Names
    • Lifecycle Stages
  • Extensions to Relational Data
  • Lab 4: Childcare Costs

Lab 3: Teacher Evaluations

  • The tidyverse package automatically loads ggplot2, dplyr, readr, etc. – do not load these twice!

  • Do not use mutate() + distinct(), when you can use summarize()!

mutate() vs summarise()

Game Plan: Rate My Professor

Clean Variable Names with janitor

Data from external sources likely has variable names not ideally formatted for R.

Names may…

  • contain spaces.
  • start with numbers.
  • start with a mix of capital and lower case letters.
names(military)[1:12]
 [1] "Country"        "Notes"          "Reporting year" "1988"          
 [5] "1989"           "1990"           "1991"           "1992"          
 [9] "1993"           "1994"           "1995"           "1996"          

Clean Variable Names with janitor

The janitor package converts all variable names in a dataset to snake_case.

Names will…

  • start with a lower case letter.
  • have spaces filled in with _.
library(janitor)
military_clean_names <- military |> 
  clean_names()

names(military_clean_names)[1:12]
 [1] "country"        "notes"          "reporting_year" "x1988"         
 [5] "x1989"          "x1990"          "x1991"          "x1992"         
 [9] "x1993"          "x1994"          "x1995"          "x1996"         

Lifceycle Stages

As packages get updated, the functions and function arguments included in those packages will change.

  • The accepted syntax for a function may change.
  • A function/functionality may disappear.

Learn more about lifecycle stages of packages, functions, function arguments in R.

Deprecated Functions

A deprecated functionality has a better alternative available and is scheduled for removal.

  • You get a warning telling you what to use instead.
military_clean |> 
  filter(across(Notes:`2019`, is.na)) |> 
  slice_head(n = 3) |> 
  select(1:8)
Warning: Using `across()` in `filter()` was deprecated in dplyr 1.0.8.
ℹ Please use `if_any()` or `if_all()` instead.
# A tibble: 3 × 8
  Country      Notes `Reporting year` `1988` `1989` `1990` `1991` `1992`
  <chr>        <chr> <chr>            <chr>  <chr>  <chr>  <chr>  <chr> 
1 Africa       <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  
2 North Africa <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  
3 Sub-Saharan  <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  

Deprecated Functions

You should not use deprecated functions!

Instead, we use…

military_clean |>
  filter(if_all(Notes:`2019`, ~ is.na(.x))) |> 
  slice_head(n = 3) |> 
  select(1:8)
# A tibble: 3 × 8
  Country      Notes `Reporting year` `1988` `1989` `1990` `1991` `1992`
  <chr>        <chr> <chr>            <chr>  <chr>  <chr>  <chr>  <chr> 
1 Africa       <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  
2 North Africa <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  
3 Sub-Saharan  <NA>  <NA>             <NA>   <NA>   <NA>   <NA>   <NA>  

Superceded Functions

A superseded functionality has a better alternative, but is not going away .

  • This is a softer alternative to deprecation.
  • A superseded function will not give a warning (since there’s no risk if you keep using it), but the documentation will give you a recommendation.

Extensions to Relational Data

Relational Data

When we work with multiple tables of data, we say we are working with relational data.

  • It is the relations, not just the individual datasets, that are important.

When we work with relational data, we rely on keys.

  • A key uniquely identifies an observation in a dataset.
  • A key allows us to relate datasets to each other

IMDb Movies Data

How can we find each director’s active years?

Joining Multiple Data Sets

directors[1:4,]
# A tibble: 4 × 3
     id first_name last_name
  <dbl> <chr>      <chr>    
1   429 Andrew     Adamson  
2  2931 Darren     Aronofsky
3  9247 Zach       Braff    
4 11652 James (I)  Cameron  
movies_directors[1:4,]
# A tibble: 4 × 2
  director_id movie_id
        <dbl>    <dbl>
1         429   300229
2        2931   254943
3        9247   124110
4       11652    10920
movies[1:4,]
# A tibble: 4 × 4
     id name           year  rank
  <dbl> <chr>         <dbl> <dbl>
1 10920 Aliens         1986  8.20
2 17173 Animal House   1978  7.5 
3 18979 Apollo 13      1995  7.5 
4 30959 Batman Begins  2005 NA   

movies_directors |> 
  inner_join(directors, 
             join_by(director_id == id))
director_id movie_id first_name last_name
429 300229 Andrew Adamson
2931 254943 Darren Aronofsky
9247 124110 Zach Braff
11652 10920 James (I) Cameron
11652 333856 James (I) Cameron
14927 192017 Ron Clements
15092 109093 Ethan Coen
15092 237431 Ethan Coen
15093 109093 Joel Coen
15093 237431 Joel Coen
15901 130128 Francis Ford Coppola
15906 194874 Sofia Coppola
16816 350424 Cameron Crowe
17810 297838 Frank Darabont
22104 224842 Clint Eastwood
24758 112290 David Fincher
28395 46169 Mel (I) Gibson
35573 18979 Ron Howard
35838 257264 John (I) Hughes
37872 300229 Vicky Jenson
38746 238695 Mike (I) Judge
41975 314965 David Koepp
44291 17173 John (I) Landis
46315 344203 Jay Levey
48115 313459 George Lucas
56332 192017 John Musker
58201 30959 Christopher Nolan
58201 210511 Christopher Nolan
65940 111813 Rob Reiner
66849 306032 Guy Ritchie
68161 116907 Herbert (I) Ross
74758 238072 Steven Soderbergh
76524 167324 Oliver (I) Stone
78273 176711 Quentin Tarantino
78273 176712 Quentin Tarantino
78273 267038 Quentin Tarantino
78273 276217 Quentin Tarantino
82525 147603 Paul (I) Verhoeven
83616 207992 Andy Wachowski
83617 207992 Larry Wachowski
88802 256630 Unknown Director
movies_directors |> 
  inner_join(directors, 
             join_by(director_id == id)) |> 
  inner_join(movies,
             join_by(movie_id == id)) |> 
  rename(movie_name = name)
director_id movie_id first_name last_name movie_name year rank
429 300229 Andrew Adamson Shrek 2001 8.1
2931 254943 Darren Aronofsky Pi 1998 7.5
9247 124110 Zach Braff Garden State 2004 8.3
11652 10920 James (I) Cameron Aliens 1986 8.2
11652 333856 James (I) Cameron Titanic 1997 6.9
14927 192017 Ron Clements Little Mermaid, The 1989 7.3
15092 109093 Ethan Coen Fargo 1996 8.2
15092 237431 Ethan Coen O Brother, Where Art Thou? 2000 7.8
15093 109093 Joel Coen Fargo 1996 8.2
15093 237431 Joel Coen O Brother, Where Art Thou? 2000 7.8
15901 130128 Francis Ford Coppola Godfather, The 1972 9.0
15906 194874 Sofia Coppola Lost in Translation 2003 8.0
16816 350424 Cameron Crowe Vanilla Sky 2001 6.9
17810 297838 Frank Darabont Shawshank Redemption, The 1994 9.0
22104 224842 Clint Eastwood Mystic River 2003 8.1
24758 112290 David Fincher Fight Club 1999 8.5
28395 46169 Mel (I) Gibson Braveheart 1995 8.3
35573 18979 Ron Howard Apollo 13 1995 7.5
35838 257264 John (I) Hughes Planes, Trains & Automobiles 1987 7.2
37872 300229 Vicky Jenson Shrek 2001 8.1
38746 238695 Mike (I) Judge Office Space 1999 7.6
41975 314965 David Koepp Stir of Echoes 1999 7.0
44291 17173 John (I) Landis Animal House 1978 7.5
46315 344203 Jay Levey UHF 1989 6.6
48115 313459 George Lucas Star Wars 1977 8.8
56332 192017 John Musker Little Mermaid, The 1989 7.3
58201 30959 Christopher Nolan Batman Begins 2005 NA
58201 210511 Christopher Nolan Memento 2000 8.7
65940 111813 Rob Reiner Few Good Men, A 1992 7.5
66849 306032 Guy Ritchie Snatch. 2000 7.9
68161 116907 Herbert (I) Ross Footloose 1984 5.8
74758 238072 Steven Soderbergh Ocean's Eleven 2001 7.5
76524 167324 Oliver (I) Stone JFK 1991 7.8
78273 176711 Quentin Tarantino Kill Bill: Vol. 1 2003 8.4
78273 176712 Quentin Tarantino Kill Bill: Vol. 2 2004 8.2
78273 267038 Quentin Tarantino Pulp Fiction 1994 8.7
78273 276217 Quentin Tarantino Reservoir Dogs 1992 8.3
82525 147603 Paul (I) Verhoeven Hollow Man 2000 5.3
83616 207992 Andy Wachowski Matrix, The 1999 8.5
83617 207992 Larry Wachowski Matrix, The 1999 8.5
88802 256630 Unknown Director Pirates of the Caribbean 2003 NA

Joining on Multiple Variables

Consider the rodent data from Lab 2.

  • We want to add species_id to the rodent measurements.
species
genus species taxa species_id
Dipodomys merriami Rodent DM
Dipodomys ordii Rodent DO
Perognathus flavus Rodent PF
Chaetodipus penicillatus Rodent PP
Peromyscus eremicus Rodent PE
Onychomys leucogaster Rodent OL
Reithrodontomys megalotis Rodent RM
Dipodomys spectabilis Rodent DS
Onychomys torridus Rodent OT
Neotoma albigula Rodent NL
Peromyscus maniculatus Rodent PM
Sigmodon hispidus Rodent SH
Reithrodontomys fulvescens Rodent RF
Chaetodipus baileyi Rodent PB
measurements
genus_name species sex hindfoot_length weight
Dipodomys merriami M 35 40
Dipodomys merriami M 37 48
Dipodomys merriami F 34 29
Dipodomys merriami F 35 46
Dipodomys merriami M 35 36
Dipodomys ordii F 32 52
Perognathus flavus M 15 8
Dipodomys merriami F 36 35
Perognathus flavus M 12 7
Dipodomys merriami F 32 22
Perognathus flavus M 16 9
Dipodomys merriami F 34 42
Perognathus flavus F 14 8
Dipodomys merriami F 35 41
Dipodomys merriami F 37 37
Dipodomys merriami F 35 43
Dipodomys merriami F 35 41
Dipodomys merriami F 33 40
Perognathus flavus F 11 9
Dipodomys merriami F 35 45
Chaetodipus penicillatus F 20 15
Dipodomys merriami M 35 29
Dipodomys merriami M 35 39
Dipodomys merriami F 36 43
Dipodomys merriami M 38 46
Dipodomys merriami M 36 41
Dipodomys merriami M 36 41
Dipodomys merriami M 38 40
Dipodomys merriami M 37 45
Dipodomys merriami F 35 46
Dipodomys merriami F 35 40
Dipodomys merriami F 35 30
Dipodomys merriami M 35 39
Dipodomys merriami M 35 34
Dipodomys merriami F 37 42
Dipodomys merriami M 37 42
Perognathus flavus F 13 8
Dipodomys merriami F 37 31
Dipodomys merriami F 36 40
Dipodomys merriami M 36 37
Dipodomys merriami M 36 48
Dipodomys merriami M 37 42
Dipodomys merriami F 39 45
Chaetodipus penicillatus F 21 16
Dipodomys merriami F 36 36
Dipodomys merriami M 36 42
Dipodomys merriami M 36 44
Dipodomys merriami F 36 41
Dipodomys merriami F 36 40
Dipodomys merriami M 37 34
Dipodomys merriami M 33 40
Dipodomys merriami M 33 44
Dipodomys merriami M 37 44
Dipodomys merriami M 34 36
Dipodomys merriami M 35 33
Dipodomys merriami F 37 46
Dipodomys merriami F 34 35
Dipodomys merriami M 36 46
Dipodomys merriami F 33 37
Dipodomys merriami M 36 34
Dipodomys merriami F 36 45
Perognathus flavus F 15 7
Dipodomys merriami M 37 51
Dipodomys merriami M 35 39
Dipodomys merriami M 36 29
Dipodomys merriami F 32 48
Dipodomys merriami M 38 46
Dipodomys merriami F 37 41
Dipodomys merriami M 37 45
Dipodomys merriami F 35 42
Dipodomys merriami F 36 53
Dipodomys merriami F 35 49
Dipodomys merriami F 36 46
Perognathus flavus F 13 9
Chaetodipus penicillatus F 19 15
Perognathus flavus M 13 4
Dipodomys merriami M 36 48
Dipodomys merriami M 37 51
Dipodomys merriami M 38 50
Dipodomys merriami M 35 44
Dipodomys merriami M 25 44
Dipodomys merriami M 35 45
Dipodomys merriami F 37 45
Peromyscus eremicus M 20 19
Dipodomys merriami F 38 44
Dipodomys merriami F 36 42
Dipodomys merriami M 37 39
Dipodomys merriami M 37 47
Dipodomys merriami M 36 42
Dipodomys merriami M 36 49
Dipodomys merriami M 38 39
Dipodomys merriami F 36 43
Dipodomys merriami M 35 50
Dipodomys merriami M 36 41
Dipodomys merriami M 37 47
Dipodomys merriami F 36 37
Dipodomys merriami M 36 41
Dipodomys merriami F 36 36
Dipodomys merriami M 36 45
Peromyscus eremicus M 19 20
species |> 
  full_join(measurements,
            join_by(species == species, 
                    genus == genus_name))
genus species taxa species_id sex hindfoot_length weight
Dipodomys merriami Rodent DM M 35 40
Dipodomys merriami Rodent DM M 37 48
Dipodomys merriami Rodent DM F 34 29
Dipodomys merriami Rodent DM F 35 46
Dipodomys merriami Rodent DM M 35 36
Dipodomys merriami Rodent DM F 36 35
Dipodomys merriami Rodent DM F 32 22
Dipodomys merriami Rodent DM F 34 42
Dipodomys merriami Rodent DM F 35 41
Dipodomys merriami Rodent DM F 37 37
Dipodomys merriami Rodent DM F 35 43
Dipodomys merriami Rodent DM F 35 41
Dipodomys merriami Rodent DM F 33 40
Dipodomys merriami Rodent DM F 35 45
Dipodomys merriami Rodent DM M 35 29
Dipodomys merriami Rodent DM M 35 39
Dipodomys merriami Rodent DM F 36 43
Dipodomys merriami Rodent DM M 38 46
Dipodomys merriami Rodent DM M 36 41
Dipodomys merriami Rodent DM M 36 41
Dipodomys merriami Rodent DM M 38 40
Dipodomys merriami Rodent DM M 37 45
Dipodomys merriami Rodent DM F 35 46
Dipodomys merriami Rodent DM F 35 40
Dipodomys merriami Rodent DM F 35 30
Dipodomys merriami Rodent DM M 35 39
Dipodomys merriami Rodent DM M 35 34
Dipodomys merriami Rodent DM F 37 42
Dipodomys merriami Rodent DM M 37 42
Dipodomys merriami Rodent DM F 37 31
Dipodomys merriami Rodent DM F 36 40
Dipodomys merriami Rodent DM M 36 37
Dipodomys merriami Rodent DM M 36 48
Dipodomys merriami Rodent DM M 37 42
Dipodomys merriami Rodent DM F 39 45
Dipodomys merriami Rodent DM F 36 36
Dipodomys merriami Rodent DM M 36 42
Dipodomys merriami Rodent DM M 36 44
Dipodomys merriami Rodent DM F 36 41
Dipodomys merriami Rodent DM F 36 40
Dipodomys merriami Rodent DM M 37 34
Dipodomys merriami Rodent DM M 33 40
Dipodomys merriami Rodent DM M 33 44
Dipodomys merriami Rodent DM M 37 44
Dipodomys merriami Rodent DM M 34 36
Dipodomys merriami Rodent DM M 35 33
Dipodomys merriami Rodent DM F 37 46
Dipodomys merriami Rodent DM F 34 35
Dipodomys merriami Rodent DM M 36 46
Dipodomys merriami Rodent DM F 33 37
Dipodomys merriami Rodent DM M 36 34
Dipodomys merriami Rodent DM F 36 45
Dipodomys merriami Rodent DM M 37 51
Dipodomys merriami Rodent DM M 35 39
Dipodomys merriami Rodent DM M 36 29
Dipodomys merriami Rodent DM F 32 48
Dipodomys merriami Rodent DM M 38 46
Dipodomys merriami Rodent DM F 37 41
Dipodomys merriami Rodent DM M 37 45
Dipodomys merriami Rodent DM F 35 42
Dipodomys merriami Rodent DM F 36 53
Dipodomys merriami Rodent DM F 35 49
Dipodomys merriami Rodent DM F 36 46
Dipodomys merriami Rodent DM M 36 48
Dipodomys merriami Rodent DM M 37 51
Dipodomys merriami Rodent DM M 38 50
Dipodomys merriami Rodent DM M 35 44
Dipodomys merriami Rodent DM M 25 44
Dipodomys merriami Rodent DM M 35 45
Dipodomys merriami Rodent DM F 37 45
Dipodomys merriami Rodent DM F 38 44
Dipodomys merriami Rodent DM F 36 42
Dipodomys merriami Rodent DM M 37 39
Dipodomys merriami Rodent DM M 37 47
Dipodomys merriami Rodent DM M 36 42
Dipodomys merriami Rodent DM M 36 49
Dipodomys merriami Rodent DM M 38 39
Dipodomys merriami Rodent DM F 36 43
Dipodomys merriami Rodent DM M 35 50
Dipodomys merriami Rodent DM M 36 41
Dipodomys merriami Rodent DM M 37 47
Dipodomys merriami Rodent DM F 36 37
Dipodomys merriami Rodent DM M 36 41
Dipodomys merriami Rodent DM F 36 36
Dipodomys merriami Rodent DM M 36 45
Dipodomys ordii Rodent DO F 32 52
Perognathus flavus Rodent PF M 15 8
Perognathus flavus Rodent PF M 12 7
Perognathus flavus Rodent PF M 16 9
Perognathus flavus Rodent PF F 14 8
Perognathus flavus Rodent PF F 11 9
Perognathus flavus Rodent PF F 13 8
Perognathus flavus Rodent PF F 15 7
Perognathus flavus Rodent PF F 13 9
Perognathus flavus Rodent PF M 13 4
Chaetodipus penicillatus Rodent PP F 20 15
Chaetodipus penicillatus Rodent PP F 21 16
Chaetodipus penicillatus Rodent PP F 19 15
Peromyscus eremicus Rodent PE M 20 19
Peromyscus eremicus Rodent PE M 19 20
Onychomys leucogaster Rodent OL NA NA NA
Reithrodontomys megalotis Rodent RM NA NA NA
Dipodomys spectabilis Rodent DS NA NA NA
Onychomys torridus Rodent OT NA NA NA
Neotoma albigula Rodent NL NA NA NA
Peromyscus maniculatus Rodent PM NA NA NA
Sigmodon hispidus Rodent SH NA NA NA
Reithrodontomys fulvescens Rodent RF NA NA NA
Chaetodipus baileyi Rodent PB NA NA NA

Lab 4: Teacher Evaluations


Filtering Joins

semi_join(): Keeps values found in another data set

anti_join(): Keeps values not found in another data set

Workflow

  1. Load packages + read in data.
library(tidyverse)
data_original <- read_csv(file = "path/to/data.csv")
  1. Clean data and save changes – use this data from now on!
data_clean <- data_original |> 
  mutate(across(x1:x5, ~ as.factor(.x)),
         new_var = if_else(...))
  1. Create subsets from your cleaned data for specific tasks.
demographics_subj <- data_clean |> 
  distinct(subj, keep_all = TRUE)
  1. Output only the the information you want in your report. Recall #| include: false and #| output: false

To do…

  • Lab 4: Childcare Costs
    • Due Monday, 2/5 at 11:59pm
  • Read Chapter 5: Special Data Types
    • Check-in 5.1 - 5.2 due Tuesday (2/6) at 8:00am